Detection of multivariate outliers in business survey data with incomplete information

نویسندگان

  • Valentin Todorov
  • Matthias Templ
  • Peter Filzmoser
چکیده

Many different methods for statistical data editing can be found in the literature but only few of them are based on robust estimates (for example such as BACON-EEM, Epidemic algorithms (EA) and Transformed rank correlation (TRC) methods of Béguin and Hulliger). However, we can show that outlier detection is only reasonable if robust methods are applied, because the classical estimates are themselves influenced by the outliers. Nevertheless, data editing is essential to check the multivariate data for possible data problems and it is not deterministic like the traditional micro editing where all records are extensively edited manually using certain rules/constraints. The presence of missing values is more a rule than an exception in business surveys and poses additional severe challenges to the outlier detection. First we review the available multivariate outlier detection methods which can cope with incomplete data. In a simulation study, where a subset of the Austrian Structural Business Statistics is simulated, we compare several approaches. Robust methods based on the Minimum Covariance Determinant (MCD) estimator, S-estimators and OGK-estimator as well as BACON-BEM provide the best results in finding the outliers and in providing a low false discovery rate. Many of the discussed methods are implemented in the R package rrcovNA which is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org under the GNU General Public License.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

Multivariate Outlier Detection and Treatment in Business Surveys

Multivariate outlier detection based on the Mahalanobis distance with the BACON-EEM algorithm, the TRC algorithm and the ER algorithm is presented and imputation of outliers and further missing values is discussed. The methods are illustrated with a data set on Swedish municipalities. The relation between outliers, influential observations and selective editing is explored. Finally robust multi...

متن کامل

Detecting multivariate outliers using projection pursuit with particle swarm optimization

Detecting outliers in the context of multivariate data is known as an important but difficult task and there already exist several detection methods. Most of the proposed methods are based either on the Mahalanobis distance of the observations to the center of the distribution or on a projection pursuit (PP) approach. In the present paper we focus on the one-dimensional PP approach which may be...

متن کامل

Investigating Outliers Detection Methods for the Iranian Manufacturing Establishment Survey Data

The role and importance of the industrial sector in the economic development specify the necessity of having accurate and timely data for exact planning. As outliers data in establishment surveys are common due to the structure of the economy, the evaluation of survey data by identifying and investigating outliers prior to the release of data is necessary. In this paper the practical applicatio...

متن کامل

Local multivariate outliers as geochemical anomaly halos indicators, a case study: Hamich area, Southern Khorasan, Iran

Anomaly recognition has always been a prominent subject in preliminary geochemical explorations. Among the regional geochemical data processing, there are a range of statistical and data mining techniques as well as different mapping methods, which serve as presentations of the outputs. The outlier’s values are of interest in the investigations where data are gathered under controlled condition...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Adv. Data Analysis and Classification

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2011